Probabilistic data association for simultaneous localization and mapping

ABSTRACT

A method for simultaneous location and mapping (SLAM) includes receiving, by at least one processor, a set of sensor measurements from a movement sensor of a mobile robot and a set of images captured by a camera on the mobile robot as the mobile robot traverses an environment. The method includes, for each image of at least a subset of the set of images, extracting, by the at least one processor, a plurality of detected objects from the image. The method includes estimating, by the at least one processor, a trajectory of the mobile robot and a respective semantic label and position of each detected object within the environment using the sensor measurements and an expectation maximization (EM) algorithm.

PRIORITY CLAIM

This application claims the benefit of U.S. Provisional PatentApplication Ser. No. 62/616,990, filed Jan. 12, 2018, the disclosure ofwhich is incorporated herein by reference in its entirety.

GOVERNMENT INTEREST

This invention was made with government support under Grant Numbers ARLMAST-CTA W911NF-08-2-0004 and ARL RCTA W911NF-10-2-0016 awarded by theArmy Research Lab. The government has certain rights in the invention.

TECHNICAL FIELD

This specification relates generally to mobile robots and computersystems for simultaneous localization and mapping.

BACKGROUND

Traditional approaches to simultaneous localization and mapping (SLAM)rely on low-level geometric features such as points, lines, and planes.They are unable to assign semantic labels to landmarks observed in theenvironment. Loop closure recognition based on low-level features isoften viewpoint-dependent and subject to failure in ambiguous orrepetitive environments. Object recognition methods can infer landmarkclasses and scales, resulting in a small set of easily recognizablelandmarks, ideal for view-independent unambiguous loop closure. In a mapwith several objects of the same class, however, a crucial dataassociation problem exists—an object observation needs to be associatedto the correct object among several objects of the same class. Whiledata association and recognition are discrete problems usually solvedusing discrete inference, classical SLAM is a continuous optimizationover metric information.

SUMMARY

This specification formulates an optimization problem over sensor statesand semantic landmark positions that integrates metric information,semantic information, and data associations, and decomposes it into twointerconnected optimization problems: an estimation of discrete dataassociation and landmark class probabilities, and a continuousoptimization over the metric states (e.g., positions and orientations).The estimated landmark and robot poses affect the association and classdistributions, which in turn affect the robot-landmark poseoptimization. The performance of the algorithm is demonstrated on indoorand outdoor datasets of RGB camera images and accelerometer andgyroscope measurements.

The subject matter described herein may be implemented in hardware,software, firmware, or any combination thereof. As such, the terms“function” or “node” as used herein refer to hardware, which may alsoinclude software and/or firmware components, for implementing thefeature(s) being described. In some exemplary implementations, thesubject matter described herein may be implemented using a computerreadable medium having stored thereon computer executable instructionsthat when executed by the processor of a computer control the computerto perform steps. Exemplary computer readable media suitable forimplementing the subject matter described herein include non-transitorycomputer readable media, such as disk memory devices, chip memorydevices, programmable logic devices, and application specific integratedcircuits. In addition, a computer readable medium that implements thesubject matter described herein may be located on a single device orcomputing platform or may be distributed across multiple devices orcomputing platforms.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow diagram of an example method for semantic SLAM;

FIG. 2 shows an example keyframe image overlaid with ORB features andobject detections;

FIG. 3 shows an estimated sensor trajectory and landmark positions andclasses using inertial, geometric, and semantic measurements;

FIG. 4 shows a sensor trajectory and estimated landmarks for the firstoffice experiment;

FIG. 5 shows estimated trajectories in the first office experiment;

FIG. 6 shows an estimated trajectory in the second office experimentfrom our algorithm along with our estimated door landmark positions,overlaid onto a partial ground truth map along with ground truth doorlocations;

FIG. 7 shows a partial ORB-SLAM2 trajectory after incorrect loop closurein second office experiment;

FIG. 8 shows a sensor trajectory and estimated landmarks for the viconexperiment;

FIG. 9 shows position errors with respect to vicon ground truth;

FIG. 10 is a chart of the norm of position error between estimate andground truth, KITTI seq. 05;

FIG. 11 is a chart of the norm of position error between estimate andground truth, KITTI seq. 06; and

FIG. 12 shows two tables specifying KITTI mean translational androtational error over path lengths (100, 200, . . . 800) meters.

DETAILED DESCRIPTION

This specification addresses the metric and semantic SLAM problemsjointly, taking advantage of object recognition to tightly integrateboth metric and semantic information into the sensor state and mapestimation. In addition to providing a meaningful interpretation of thescene, semantically-labeled landmarks address two critical issues ofgeometric SLAM: data association (matching sensor observations to maplandmarks) and loop closure (recognizing previously-visited locations).This specification provides the following:

-   -   an approach to tightly couple inertial, geometric, and semantic        observations into a single optimization framework,    -   a formal decomposition of the joint metric-semantic SLAM problem        into continuous (pose) and discrete (data association and        semantic label) optimization sub-problems,    -   experiments on several long-trajectory real indoor and outdoor        datasets, which include odometry and visual measurements in        cluttered scenes and varying lighting conditions.

This specification describes methods, systems, computer readablemediums, and mobile robots for semantic SLAM.

FIG. 1 is a flow diagram of an example method 100 for semantic SLAM. Themethod 100 can be performed by a computer system including at least oneprocessor and memory storing instructions for the processor. Forexample, the method 100 can be performed by an on-board computer systemof a mobile robot with a camera and a movement sensor or by a remotecomputer system remote from the mobile robot.

The method 100 includes receiving a set of sensor measurements from amovement sensor of a mobile robot (102). The method includes receiving aset of images captured by a camera on the mobile robot as the mobilerobot traverses an environment (104). For example, receiving the set ofsensor measurements from the movement sensor can include receivingaccelerometer and gyroscope measurements from an inertial measurementunit of the mobile robot, and receiving the set of sensor measurementsand the set of images can include time synchronizing the sensormeasurements and the images.

The method 100 includes, for each image of at least a subset of the setof images, extracting detected objects from the image (106). The method100 includes estimating a trajectory of the mobile robot and arespective semantic label and position of each detected object withinthe environment using the sensor measurements and an expectationmaximization (EM) algorithm (108). An example of semantic SLAM using anEM algorithm is described further below.

For example, using the EM algorithm can include iteratively solving fora data association distribution for a plurality of data associationsbetween the detected objects and the semantic labels and positions ofthe detected objects, e.g., using a matrix permanent algorithm, in anexpectation step of the EM algorithm. Using the EM algorithm can includeiteratively solving for respective mobile robot positions and semanticlabels and positions of detected objects using the data associationdistribution in a maximization step of the EM algorithm. Using the EMalgorithm can include iterating between the expectation step and themaximization step until an end condition is reached.

Estimating the trajectory of the mobile robot and the semantic label andposition of each detected object can include, for each image of thesubset of the set of images, extracting geometric point features. Themethod 100 can include tracking the geometric point features across theset of images.

Estimating the trajectory of the mobile robot and the semantic label andposition of each detected object can include constructing a pose graphincluding vertices for mobile robot poses and vertices for detectedobject positions. Estimating the trajectory of the mobile robot and thesemantic label and position of each detected object can includedetermining soft data associations between mobile robot poses anddetected object positions using, for each soft data association, anumber of different factors, e.g., semantic factors, geometric factors,and movement sensor factors.

Further examples of semantic SLAM are provided below. Observed objectsin the environment are generally in one of two categories: (1) objectsalready present in our constructed map that are re-observed, or (2)objects that have been previously unseen and must be added to ourestimated map. The examples in the first section below describe handlingobjects in class (1) in a probabilistic EM way, and the examples in thesecond section below include a method to include both categories (1) and(2) probabilistically.

Probabilistic Data Association for Semantic SLAM—Section I

The following section describes a study performed regarding semanticSLAM. The study illustrates examples of semantic SLAM.

Traditional approaches to simultaneous localization and mapping (SLAM)rely on low-level geometric features such as points, lines, and planes.They are unable to assign semantic labels to landmarks observed in theenvironment. Furthermore, loop closure recognition based on low-levelfeatures is often viewpoint-dependent and subject to failure inambiguous or repetitive environments. On the other hand, objectrecognition methods can infer landmark classes and scales, resulting ina small set of easily recognizable landmarks, ideal for view-independentunambiguous loop closure. In a map with several objects of the sameclass, however, a crucial data association problem exists. While dataassociation and recognition are discrete problems usually solved usingdiscrete inference, classical SLAM is a continuous optimization overmetric information. In this paper, we formulate an optimization problemover sensor states and semantic landmark positions that integratesmetric information, semantic information, and data associations, anddecompose it into two interconnected problems: an estimation of discretedata association and landmark class probabilities, and a continuousoptimization over the metric states. The estimated landmark and robotposes affect the association and class distributions, which in turnaffect the robot-landmark pose optimization. The performance of ouralgorithm is demonstrated on indoor and outdoor datasets.

I. Introduction

In robotics, simultaneous localization and mapping (SLAM) is the problemof mapping an unknown environment while estimating a robot's pose withinit. Reliable navigation, object manipulation, autonomous surveillance,and many other tasks require accurate knowledge of the robot's pose andthe surrounding environment. Traditional approaches to SLAM rely onlow-level geometric features such as corners [1], lines [2], and surfacepatches [3] to reconstruct the metric 3-D structure of a scene but aremostly unable to infer semantic content. On the other hand, recentmethods for object recognition [4]-[6] can be combined with approximate3D reconstruction of the environmental layout from single frames usingpriors [7], [8]. These are rather qualitative single 3D snapshots ratherthan the more precise mapping we need for a robot to navigate. The goalof this paper is to address the metric and semantic SLAM problemsjointly, taking advantage of object recognition to tightly integrateboth metric and semantic information into the sensor state and mapestimation. In addition to providing a meaningful interpretation of thescene, semantically-labeled landmarks address two critical issues ofgeometric SLAM: data association (matching sensor observations to maplandmarks) and loop closure (recognizing previously-visited locations).

Approaches to SLAM were initially most often based on filtering methodsin which only the most recent robot pose is estimated [9]. This approachis in general very computationally efficient, however because of theinability to estimate past poses and relinearize previous measurementfunctions, errors can compound [1]. More recently, batch methods thatoptimize over entire trajectories have gained popularity. Successfulbatch methods typically represent optimization variables as a set ofnodes in a graph (a “pose graph”). Two robot-pose nodes share an edge ifan odometry measurement is available between them, while a landmark anda robot-pose node share an edge if the landmark was observed from thecorresponding robot pose. This pose graph optimization formulation ofSLAM traces back to Lu and Milios [10]. In recent years, the state ofthe art [11], [12] consists of iterative optimization methods (e.g.,nonlinear least squares via the Gauss-Newton algorithm) that achieveexcellent performance but depend heavily on linearization of the sensingand motion models. This becomes a problem when we consider includingdiscrete observations, such as detected object classes, in the sensingmodel.

One of the first systems that used both spatial and semanticrepresentations was proposed by Galindo et al. [13]. A spatial hierarchycontained camera images, local metric maps, and the environmenttopology, while a semantic hierarchy represented concepts and relations,which allowed room categories to be inferred based on object detections.Many other approaches [14]-[19] extract both metric and semanticinformation but typically the two processes are carried out separatelyand the results are merged afterwards. The lack of integration betweenthe metric and the semantic mapping does not allow the object detectionconfidence to influence the performance of the metric optimization.Focusing on the localization problem only, Atanasov et al. [20]incorporated semantic observations in the metric optimization via aset-based Bayes filter. The works that are closest to ours [21]-[24]consider both localization and mapping and carry out metric and semanticmapping jointly. SLAM++ [22] focuses on a realtime implementation ofjoint 3-D object recognition and RGBD SLAM via pose graph optimization.A global optimization for 3D reconstruction and semantic parsing hasbeen proposed by [25], which is the closest work in semantic/geometricjoint optmization. The main difference is that 3D space is voxelized andlandmarks and/or semantic labels are assigned to voxels which areconnected in a conditional random field while our approach allows theestimation of continuous pose of objects. Bao et al. [21] incorporatecamera parameters, object geometry, and object classes into a structurefrom motion problem, resulting in a detailed and accurate but large andexpensive optimization. A recent comprehensive survey of semanticmapping can be found in [26].

Most related work uses a somewhat arbitrary decomposition between dataassociation, pose graph optimization, and object recognition. Our workmakes the following contributions to the state of the art:

-   -   our approach is the first to tightly couple inertial, geometric,        and semantic observations into a single optimization framework,    -   we provide a formal decomposition of the joint metricsemantic        SLAM problem into continuous (pose) and discrete (data        association and semantic label) optimization sub-problems,    -   we carry out experiments on several long-trajectory real indoor        and outdoor datasets, which include odometry and visual        measurements in cluttered scenes and varying lighting        conditions.

II. Probabilistic Data Association in SLAM

Consider the classical localization and mapping problem, in which amobile sensor moves through an unknown environment, modeled as acollection

{

_(m)}_(m=1) ^(M) of M static landmarks. Given a set of sensormeasurements

{z_(k)}_(k=1) ^(K), the task is to estimate the landmark positions

and a sequence of poses X

{x_(t)}_(t=1) ^(T) representing the sensor trajectory. Most existingwork focuses on estimating X and

and rarely emphasizes that the data association

{(α_(k), β_(k))}_(k=1) ^(K) stipulating that measurement z_(k) oflandmark

_(β) _(k) was obtained from sensor state x_(α) _(k) is in fact unknown.A complete statement of the SLAM problem involves maximum likelihoodestimation of X,

, and

given the measurements

:

$\begin{matrix}{\hat{},\hat{\mathcal{L}},{\hat{} = {\underset{,\mathcal{L},}{{\arg {\mspace{14mu} \;}\max}\;}\log \; {p\left( {\left.  \middle|  \right.,\mathcal{L},} \right)}}}} & (1)\end{matrix}$

The most common approach to this maximization has been to decompose itinto two separate estimation problems. First, given prior estimates X⁰and

⁰, the maximum likelihood estimate

of the data association

is computed (e.g., via joint compatability branch and bound [27] or theHungarian algorithm [28]). Then, given

, the most likely landmark and sensor states are estimated:

$\begin{matrix}{\hat{} = {\underset{}{{\arg {\mspace{14mu} \;}\max}\;}{p\left( {\left.  \middle| ^{0} \right.,\mathcal{L}^{0},} \right)}}} & \left( {2a} \right) \\{\hat{},{\hat{\mathcal{L}} = {\underset{,\mathcal{L}}{{\arg {\mspace{14mu} \;}\max}\;}\log \; {p\left( {\left.  \middle|  \right.,\mathcal{L},\hat{}} \right)}}}} & \left( {2b} \right)\end{matrix}$

The second optimization above is typically carried out via filtering[30]-[32] or pose-graph optimization [11], [12].

The above process has the disadvantage that an incorrectly chosen dataassociation may have a highly detrimental effect on the estimationperformance. Moreover, if ambiguous measurements are discarded to avoidincorrect association choices, they will never be reconsidered laterwhen refined estimates of the sensor pose (and hence their dataassociation) are available. Instead of a simple one step process, then,it is possible to perform coordinate descent, which iterates the twomaximization steps as follows:

$\begin{matrix}{^{i + 1} = {\underset{}{{\arg {\mspace{14mu} \;}\max}\;}{p\left( {\left.  \middle| ^{i} \right.,\mathcal{L}^{i},} \right)}}} & \left( {3a} \right) \\{^{i + 1},{\mathcal{L}^{i + 1} = {\underset{,\mathcal{L}}{{\arg {\mspace{14mu} \;}\max}\;}\log \; {p\left( {\left.  \middle|  \right.,\mathcal{L},^{i + 1}} \right)}}}} & \left( {3b} \right)\end{matrix}$

This resolves the problem of being able to revisit association decisionsonce state estimates improve but does little to resolve the problem withambiguous measurements since a hard decision on data associations isstill required. To address this, rather than simply selecting

as the mode of p(

|X,

,

), we should consider the entire density of

when estimating X and

. Given initial estimates Xi, Li, an improved estimate that utilizes thewhole density of

can be computed by maximizing the expected measurement likelihood viaexpectation maximization (EM):

$\begin{matrix}\begin{matrix}{^{i + 1},{\mathcal{L}^{i + 1} = {\underset{,\mathcal{L}}{{\arg {\mspace{14mu} \;}\max}\;}{_{}\left\lbrack {\left. {\log \; {p\left( {\left.  \middle|  \right.,\mathcal{L},} \right)}} \middle| ^{i} \right.,\mathcal{L}^{i},} \right\rbrack}}}} \\{= {\underset{,\mathcal{L}}{{\arg {\mspace{14mu} \;}\max}\;}{\sum\limits_{ \in }\; {{p\left( {\left.  \middle| ^{i} \right.,\mathcal{L}^{i},} \right)}\log \; {p\left( {\left.  \middle|  \right.,\mathcal{L},} \right)}}}}}\end{matrix} & (4)\end{matrix}$

where

is the space of all possible values of

. This EM formulation has the advantage that no hard decisions on dataassociation are required since it “averages” over all possibleassociations. To compare this with the coordinate descent formulation in(3), we can rewrite (4) as follows:

$\begin{matrix}{{\underset{,\mathcal{L}}{{\arg {\mspace{14mu} \;}\max}\;}{\sum\limits_{ \in }{\sum\limits_{k = 1}^{K}{{p\left( {\left.  \middle| ^{i} \right.,\mathcal{L}^{i},} \right)}\log \; {p\left( {\left. z_{k} \middle| x_{\alpha_{k}} \right.,_{\beta_{k}}} \right)}}}}} = {\underset{,\mathcal{L}}{{\arg {\mspace{14mu} \;}\max}\;}{\sum\limits_{k = 1}^{K}{\sum\limits_{j = 1}^{M}{w_{kj}^{i}\log \; {p\left( {\left. z_{k} \middle| x_{\alpha_{k}} \right.,_{j}} \right)}}}}}} & (5)\end{matrix}$

where w_(kj) ^(i)

Σ

_(∈)

_((k,j))p(

|X^(i),

^(i),

) is a weight, independent of the optimization variables X and

, that quantifies the influence of the “soft” data association, and

(k,j)

{

∈

|β_(k)=j}⊆

is the set of all data associations such that measurement k is assignedto landmark j. Note that the coordinate descent optimization (3b) has asimilar form to (5), except that for each k there is exactly one j suchthat w_(kj) ^(i)=1 and w_(kl) ^(i)=0 for all l≠j.

We can also show that the EM formulation, besides being a generalizationof coordinate descent, is equivalent to the following matrix permanentmaximization problem.

Proposition 1.

If p(

|X^(i),

^(i)) is uniform, the maximizers of the EM formulation in (4) and theoptimization below are equal:

$^{i + 1},{\mathcal{L}^{i + 1} = {\underset{,\mathcal{L}}{{\arg {\mspace{14mu} \;}\max}\;}\mspace{14mu} {{per}\left( {Q^{i}\left( {,\mathcal{L}} \right)} \right)}}},$

where per denotes the matrix permanent², Q^(i)(X,

) is a matrix with elements [Q^(i)]_(kj):=p(z_(k)|x_(j) ^(i),

_(j) ^(i))p(z_(k)|x_(j),

_(j)) and {(x_(j) ^(i),

_(j) ^(i))} and {(x_(j),

_(j))} are enumerations of the sets X^(i)×

^(i) and X×

, respectively.Proof. See Appendix I.

Similar to the coordinate descent formulation, the EM formulation (5)allows us to solve the permanent maximization problem iteratively.First, instead of estimating a maximum likelihood data association, weestimate the data association distribution p(D|X^(i),

^(i),

) in the form of the weights w_(kj) ^(i) (the “E” step). Then, wemaximize the expected measurement log likelihood over the previouslycomputed distribution (the “M” step).

III. Semantic SLAM

In the rest of the paper, we focus on a particular formulation of theSLAM problem that in addition to sensor and landmark poses involveslandmark classes (e.g., door, chair, table) and semantic measurements inthe form of object detections. We will demonstrate that the expectationmaximization formulation (5) is an effective way to solve the semanticSLAM problem.

Let the state

of each landmark consist of its position

^(p) ∈

³ as well as a class label

^(c) from a discrete set

={1, . . . , C}. To estimate the landmark states

and sensor trajectory X, we utilize three sources of information:inertial, geometric point features, and semantic object observations.Examples of geometric features and semantic observations can be seen inFIG. 2.

A. Inertial information

We assume that the sensor package consists of an inertial measurementunit (IMU) and one monocular camera. A subset of the images captured bythe camera are chosen as keyframes (e.g., by selecting every nth frameas a keyframe). The sensor state corresponding to the tth keyframe isdenoted x_(t) and consists of the sensor 6-D pose, velocity, and IMUbias values. We assume that the IMU and camera are time synchronized, sobetween keyframes t and t+1, the sensor also collects a set

_(t) of IMU measurements (linear acceleration and rotational velocity).

B. Geometric Information

In addition to the inertial measurements

_(t), we utilize geometric point measurements (e.g., Harris corners,SIFT, SURF, FAST, BRISK, ORB, etc.)

_(t). From each keyframe image, these geometric point features areextracted and tracked forward to the subsequent keyframe. In ourexperiments we extract ORB features [33] from each keyframe and matchthem to the subsequent keyframe by minimizing the ORB descriptordistance. Since these features are matched by an external method, weassume that their data association is known.

C. Semantic Information

The last type of measurement used are object detections S_(t) extractedfrom every keyframe image. An object detection s_(k)=(s_(k) ^(c), s_(k)^(s), s_(k) ^(b)) ∈ S_(t) extracted from keyframe t consists of adetected class s_(k) ^(c) ∈

, a score s_(k) ^(s) quantifying the detection confidence, and abounding box s_(k) ^(s). Such information can be obtained from anymodern approach for object recognition such as [5], [34]-[36]. In ourimplementation, we use a deformable parts model (DPM) detector [4],[37], [38], which runs on a CPU in real time. If the data association

_(k)=(α_(k), β_(k)) of measurement s_(k) is known, the measurementlikelihood can be decomposed as follows: p(s_(k)|x_(α) _(k) ,

_(β) _(k) )=p(s_(k) ^(c)|

_(β) _(k) ^(c))p(s_(k) ^(s)|

_(β) _(k) ^(c), s_(k) ^(c))p(s_(k) ^(b)|x_(α) _(k) ,

_(β) _(k) ^(p)). The density p(s_(k) ^(c)|

_(β) _(k) ^(c)) corresponds to the confusion matrix of the objectdetector and is learned offline along with the score distributionp(s_(k) ^(s)|

_(β) _(k) ^(c), s_(k) ^(c)). The bounding-box likelihood p(s_(k)^(b)|x_(α) _(k) ,

_(β) _(k) ^(p)) is assumed normally distributed with mean equal to theperspective projection of the centroid of the object onto the imageplane and covariance proportional to the dimensions of the detectedbounding box.

-   Problem (Semantic SLAM). Given inertial    {    _(t)}_(t=1) ^(T), geometric    {    _(t)}_(t=1) ^(T), and semantic S    {S_(t)}_(t=1) ^(T) measurements, estimate the sensor state    trajectory X and the positions and classes    of the objects in the environment. The inertial and geometric    measurements are used to track the sensor trajectory locally and,    similar to a visual odometry approach, the geometric structure is    not recovered. The semantic measurements, in contrast, are used to    construct a map of objects that can be used to perform loop closure    that is robust to ambiguities and viewpoint and is more efficient    than a SLAM approach that maintains full geometric structure.

IV. Semantic SLAM Using EM

Following the observations from Sec. II, we apply expectationmaximization to robustly handle the semantic data association. Inaddition to treating data association as a latent variable, we alsotreat the discrete landmark class labels as latent variables in theoptimization, resulting in a clean and efficient separation betweendiscrete and continuous variables. As mentioned in Sec. III, the dataassociation of the geometric measurements is provided by the featuretracking algorithm, so the latent variables we use are the dataassociation of the semantic measurements and the object classes

_(1:M) ^(c). The following proposition specifies the EM steps necessaryto solve the semantic SLAM problem. The initial guess X⁽⁰⁾ is providedby odometry integration; the initial guess

⁽⁰⁾ can be obtained from X⁽⁰⁾ by initializing a landmark along thedetected camera ray.

-   Proposition 2. If p(    |X,    ) is uniform and the semantic measurement data associations are    independent across keyframes, i.e., p(    |S, X,    )=Π_(t=1) ^(T)p(    _(t)|S_(t), X,    ), the semantic SLAM problem can be solved via the expectation    maximization algorithm by iteratively solving for (1) data    association weights w_(ij) ^(t) (the “E” step) and (2) continuous    sensor states x and landmark positions l_(1:M) ^(p) (the “M” step)    via the following equations:

$\begin{matrix}{{w_{kj}^{t,{(i)}} = {\sum\limits_{^{c} \in }{\sum\limits_{_{t} \in {_{t}{({k,j})}}}{{\kappa^{(i)}\left( {_{t},^{c}} \right)}{\forall t}}}}},k,j} & (6) \\{^{({i + 1})},{_{1:M}^{p,{({i + 1})}} = {\underset{,_{1:M}^{P}}{\arg \mspace{14mu} \min}{\sum\limits_{t = 1}^{T}{\sum\limits_{s_{k} \in _{t}}^{\;}{\sum\limits_{j = 1}^{M}{{- w_{kj}^{t,{(i)}}}\log \; {p\left( {\left. s_{k} \middle| x_{t} \right.,_{j}} \right)}}}}}}}} & (7)\end{matrix}$

_(t) is the set of all possible data associations for measurementsreceived at timestep t, and

_(t)(i, j)⊆

_(t) is the set of all possible data associations for measurementsreceived at time t such that measurement i is assigned to landmark j.Proof. See Appendix II.

A. Object Classes and Data Association (E Step)

The computation of the weights for a single keyframe require severalcombinatorial sums over all possible data associations. However, due tothe assumption of independent associations among keyframes and the factthat only few objects are present within the sensor field-of-view, it isfeasible to compute the summations and hence w_(kj) ^(t) for allkeyframes t, measurements k, and landmarks j extremely efficiently inpractice. Once the weights w_(kj) ^(t,(i)) are computed for eachmeasurement-landmark pair, they are used within the continuousoptimization over sensor states and landmark positions. Additionally,maximum likelihood landmark class estimates

^(c) can be recovered from the computed k values:

${\hat{}}_{1:M}^{c} = {{\underset{^{c}}{\arg \mspace{14mu} \max}{p\left( {\left. _{1:M}^{c} \middle| \theta \right.,} \right)}} = {\underset{^{c}}{\arg \mspace{14mu} \max}{\prod\limits_{t = 1}^{T}{\sum\limits_{_{t} \in _{t}}{\kappa \left( {_{t},^{c}} \right)}}}}}$

B. Pose Graph Optimization (M Step)

Equation (7) forms the basis of our pose graph optimization over sensorstates and landmark positions. A pose graph is a convenient way ofrepresenting an optimization problem for which there exists a clearphysical structure or a sparse constraint set. The graph consists of aset of vertices v, each of which corresponds to an optimizationvariable, and a set of factors

among the vertices that correspond to individual components of the costfunction. Graphically, a factor is a generalization of an edge thatallows connectivity between more than two vertices. A factor f in thegraph is associated with a cost function that depends on a subset of thevariables v such that the entire optimization is of the form

$\begin{matrix}{\hat{} = {\underset{}{\arg \mspace{14mu} \min}{\sum\limits_{f \in \mathcal{F}}^{\;}{f()}}}} & (8)\end{matrix}$

In addition to providing a useful representation, factor graphs areadvantageous in that there exist computational tools that allowefficient optimization [11], [39]. Our graph has a vertex for eachsensor state x_(t) and for each landmark position

_(i) ^(p). Contrary to most prior work in which a hard data associationdecision results in a measurement defining a single factor between asensor pose and a landmark, we consider soft semantic data associationmultiple factors.

1) Semantic Factors: A measurement s_(k) from sensor state x_(i) definesfactors f_(kj) ^(s)(x_(i),

_(j)) for each visible landmark j. Assuming the number of visiblelandmarks and the number of received measurements are approximatelyequal, with this method the number of semantic factors in the graph isroughly squared. Note that since

^(c) is fixed in (7), p(s^(s)|

^(c), s^(c)) and p(s^(c)|

^(c)) are constant. Thus, log p(s|x,

)=log p(s^(b)|x,

^(p))+log p(s^(s)|

^(c), s^(c))p(s^(c)|

^(c)) and so the latter term can be dropped from the optimization.

Let h_(π)(x,

^(p)) be the standard perspective projection of a landmark

^(p) onto a camera at pose x. We assume that the camera measurement of alandmark

^(p) from camera pose x is Gaussian distributed with mean h_(π)(x,

^(p)) and covariance R_(s). Thus, a camera factor corresponding tosensor state t, measurement k, and landmark j, f_(kj) ^(s), becomes

$\begin{matrix}{{f_{kj}^{s}\left( {,\mathcal{L}} \right)} = {{- w_{kj}^{t,{(i)}}}\log \; {p\left( {\left. s_{k}^{b} \middle| x_{t} \right.,_{j}^{p}} \right)}}} & (9) \\{= {{s_{k}^{b} - {h_{\pi}\left( {x_{t},_{j}} \right)}}}_{R_{s}/w_{kj}^{t,{(i)}}}^{2}} & (10)\end{matrix}$

Those semantic factors due to the re-observation of a previously seenlandmark are our method's source of loop closure constraints.

2) Geometric Factors: Following [30], [40], we incorporate geometricmeasurements into the pose graph as structureless constraints betweenthe camera poses that observed them. We can rewrite the termcorresponding to geometric factors in (7) as

$\begin{matrix}{{{- \log}\; {p\left(  \middle|  \right)}} = {- {\sum\limits_{i = 1}^{N_{y}}{\sum\limits_{{k\text{:}\beta_{k}^{y}} = i}^{\;}{\log \; {p\left( y_{k} \middle| x_{\alpha_{k}^{y}} \right)}}}}}} & (11)\end{matrix}$

where N_(y) is the total number of distinct feature tracks, i.e. thetotal number of observed physical geometric landmarks.

Letting ρ_(β) _(k) _(y) be the 3D position in the global frame of thelandmark that generated measurement y_(k), and assuming as before thatthe projection has Gaussian pixel noise with covariance R_(y), we have

$\begin{matrix}{{{- \log}\; {p\left(  \middle|  \right)}} = {- {\sum\limits_{i = 1}^{N_{y}}{\sum\limits_{{k\text{:}\beta_{k}^{y}} = i}^{\;}{{y_{k} - {h_{\pi}\left( {x_{\alpha_{k}^{y}},\rho_{i}} \right)}}}_{R_{y}}^{2}}}}} & (12)\end{matrix}$

For a single observed landmark ρ_(i), the factor constraining the cameraposes which observed it takes the form

$\begin{matrix}{{f_{i}^{y}()} = {\sum\limits_{{k:\beta_{k}^{y}} = i}{{y_{k} - {h_{\pi}\left( {x_{\alpha_{k}^{y}},\rho_{i}} \right)}}}_{R_{y}}^{2}}} & (13)\end{matrix}$

Because we use iterative methods to optimize the full pose graph, it isnecessary to linearize the above cost term. The linearization of theabove results in a cost term of the form

$\begin{matrix}{\sum\limits_{{k:\beta_{k}^{y}} = i}{{{H_{ik}^{\rho}{\delta\rho}_{i}} + {H_{ik}^{x}\delta \; x_{\alpha_{k}^{y}}} + b_{ik}}}^{2}} & (14)\end{matrix}$

where H_(ik) ^(ρ) is the Jacobian of the cost function with respect toρ_(β) _(k) _(y) , H_(ik) ^(x) is the Jacobian with respect to x_(α) _(k)_(y) , b_(ik) is a function of the measurement and its error, and thelinearized cost term is in terms of deltas δx, δρ rather than the truevalues x, ρ.

Writing the inner summation in one matrix form by stacking theindividual components, we can write this simply as ∥H_(i)^(ρ)δρ_(i)+H_(i) ^(x)δx_(α) _(y) _((i))+b_(i)∥². To avoid optimizingover ρ values, and hence to remove the dependence of the cost functionupon them, we project the cost into the null space of its Jacobian. Wepremultiply each cost term by A_(i), a matrix whose columns span theleft nullspace of H_(i) ^(ρ). The cost term for the structurelessgeometric features thus becomes a function of only the states whichobserve it:

∥A_(i)H_(i) ^(x)δx_(α) _(y) _((i))+A_(i)b_(i)∥²   (15)

3) Inertial Factors: To incorporate the accelerometer and gyroscopemeasurements into the pose graph, we use the method of preintegrationfactors detailed in [40]. The authors provide an efficient method ofcomputing inertial residuals between two keyframes x_(i) and x_(j) inwhich several inertial measurements were received. By “preintegrating”all IMU measurements received between the two keyframes, the relativepose difference (i.e. difference in position, velocity, and orientation)between the two successive keyframes is estimated. Using this estimatedrelative pose, the authors provide expressions for inertial residuals onthe rotation (r_(ΔR) _(ij) ), velocity (r_(Δv) _(ij) ), and position(r_(Δp) _(ij) ) differences between two keyframes as a function of theposes x_(i) and x_(j). Specifically, they provide said expressions alongwith their noise covariances Σ such that

$\begin{matrix}{{f_{i}^{\mathcal{I}}()} = {{- \log}\mspace{11mu} {p\left( {\mathcal{I}_{ij}} \right)}}} & {{~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~}(16)} \\{= {r_{\mathcal{I}_{ij}}}_{\overset{2}{\sum\limits_{ij}}}} & {(17)}\end{matrix}$

The full pose graph optimization corresponding to equation (7) is then anonlinear least squares problem involving semantic observation terms(see (10)), geometric observation terms (see (15)), and inertial terms(see (17)).

$\begin{matrix}{{\hat{x}}_{1\text{:}T},{{\hat{}}_{1\text{:}M} = {{\underset{,_{1:M}}{\arg \mspace{14mu} \min}{\sum\limits_{k = 1}^{K}{\sum\limits_{j = 1}^{M}{f_{kj}^{s}\left( {,_{1\text{:}M}^{p}} \right)}}}} + {\sum\limits_{i = 1}^{N_{y}}{f_{i}^{y}()}} + {\sum\limits_{t = 1}^{T - 1}{f_{t}^{\mathcal{I}}()}}}}} & (18)\end{matrix}$

We solve this within the iSAM2 framework [12], which is able to providea near-optimal solution with real-time performance.

V. Experiments

We implemented our algorithm in C++ using GTSAM [39] and its iSAM2implementation as the optimization back-end. All experiments were ableto be computed in real-time.

The front-end in our implementation simply selects every 15th cameraframe as a keyframe. As mentioned in section IIIB, the trackingfront-end extracts ORB features [33] from every selected keyframe andtracks them forward through the images by matching the ORB descriptors.Outlier tracks are eliminated by estimating the essential matrix betweenthe two views using RANSAC and removing those features which do not fitthe estimated model. We assume that the timeframe between two subsequentimages is short enough that the orientation difference between the twoframes can be estimated accurately by integrating the gyroscopemeasurements. Thus, only the unit translation vector between the twoimages needs to be estimated. We can then estimate the essential matrixusing only two point correspondences [41].

The front-end's object detector is an implementation of the deformableparts model detection algorithm [38]. On the acquisition of the semanticmeasurements from a new keyframe, the Mahalanobis distance from themeasurement to all known landmarks is computed. If all such distancesare above a certain threshold, a new landmark is initialized in the map,with initial position estimate along the camera ray, with depth given bythe median depth of all geometric feature measurements within itsdetected bounding box (or some fixed value if no such features weretracked successfully).

While ideally we would iterate between solving for constraint weightsw_(ij) and poses as proposition 2 suggests, in practice forcomputational reasons we solve for the weights just once per keyframe.

Our experimental platform was a VI-Sensor [42] from which we used theIMU and left camera. We performed three separate experiments. The firstconsists of a medium length (approx. 175 meters) trajectory around onefloor of an office building, in which the object classes detected andkept in the map were two types of chairs (red office chairs and brownfour-legged chairs). The second experiment is a long (approx. 625meters) trajectory around two different floors of an office building.The classes in the second experiment are red office chairs and doors.The third and final trajectory is several loops around a room equippedwith a vicon motion tracking system, in which the only class of objectsdetected is red office chairs. In addition to our own experiments, weapplied our algorithm to the KITTI dataset [43] odometry sequences 05and 06.

The final trajectory estimate along with the estimated semantic map forthe first office experiment is shown in FIG. 4. The trajectoriesestimated by our algorithm, by the ROVIO visual-inertial odometryalgorithm [31], and by the ORB-SLAM2 visual SLAM algorithm [44], [45],projected into the x-y plane, are shown in FIG. 5. Due to a lack ofinertial information and a relative lack of visual features in theenvironment, ORB-SLAM2 frequently got lost and much of the trajectoryestimate is missing, but was always able to recover when entering apreviously mapped region.

The second office experiment trajectory along with the estimated map isshown in FIG. 3. An example image overlaid with object detections fromnear the beginning of this trajectory is displayed in FIG. 2. Weconstructed a partial map of the top floor in the experiment using aground robot equipped with a lidar scanner. On this ground truth map, wemanually picked out door locations. The portion of the estimatedtrajectory on the top floor is overlayed onto this partial truth map(the two were manually aligned) in FIG. 6. Due to the extremelyrepetitive nature of the hallways in this experiment, bag-of-words basedloop closure detections are subject to false positives and incorrectmatches. ORBSLAM2 was unable to successfully estimate the trajectory dueto such false loop closures. A partial trajectory estimate after anincorrect loop closure detection is shown in FIG. 7.

The vicon trajectory and the estimated map of chairs is shown in FIG. 8.We evaluated the position error with respect to the vicon's estimate forour algorithm, ROVIO, and ORBSLAM2 and the results are shown in FIG. 9.Note that the spikes in the estimate errors are due to momentaryocclusion from the vicon cameras.

We also evaluated our algorithm on the KITTI outdoor dataset, usingodometry sequences 05 and 06. The semantic objects detected and used inour algorithm were cars. Rather than use inertial odometry in thisexperiment, we used the VISO2 [46] visual odometry algorithm as theinitial guess X⁽⁰⁾ for a new keyframe state. Similarly, we replaced thepreintegrated inertial relative pose (cf. Sec. IV-B.3) with the relativepose obtained from VISO in the odometry factors. The absolute positionerrors over time for KITTI sequence 05 with respect to ground truth forour algorithm, VISO2, and ORB-SLAM2 with monocular and stereo camerasare shown in FIG. 10. The same for sequence 06 are shown in FIG. 11.Finally, the mean translational and rotational errors over all possiblesubpaths of length (100, 200, . . . , 800) meters are shown in FIG. 12.

VI. Conclusion

The experiments demonstrated that in complex and cluttered real-worlddatasets our method can be used to reconstruct the full 6-D pose historyof the sensor and the positions and classes of the objects contained inthe environment. The advantage of our work is that by having semanticfeatures directly into the optimization, we include a relatively sparseand easily distinguishable set of features that allows for improvedlocalization performance and loop closure, while only slightly impactingthe computational cost of the algorithm. Furthermore, semanticinformation about the environment is valuable in and of itself in aidingautonomous operation of robots within a human-centric environment.

In future work, we plan to expand our algorithm to estimate the fullpose of the semantic objects (i.e., orientation in addition toposition). We also plan to fully exploit our EM decomposition byreconsidering data associations for past keyframes, and to considersystems with multiple sensors and non-stationary objects.

Appendix I: Proof of Proposition 1

First, we rewrite the optimization in (4) without a logarithm andsimilarly expand the expectation:

$^{i + 1},{\mathcal{L}^{i + 1} = {\underset{,\mathcal{L}}{\arg \mspace{11mu} \max}{\sum\limits_{ \in }{{p\left( {\left.  \middle| ^{i} \right.,\mathcal{L}^{i},} \right)}{p\left( {\left.  \middle|  \right.,\mathcal{L},} \right)}}}}}$

The data association likelihood can then be rewrite as

$\begin{matrix}{{p\left( {{^{i}},\mathcal{L}^{i},} \right)} = \frac{{p\left( {\left.  \middle| ^{i} \right.,\mathcal{L}^{i},} \right)}{p\left( {\left.  \middle| ^{i} \right.,\mathcal{L}^{i}} \right)}}{\sum_{}{{p\left( {\left.  \middle| ^{i} \right.,\mathcal{L}^{i},} \right)}{p\left( {\left.  \middle| ^{i} \right.,\mathcal{L}^{i}} \right)}}}} & {{~~~~~~~~~~~~~~~}(19)} \\{= \frac{p\left( {\left.  \middle| ^{i} \right.,\mathcal{L}^{i},} \right)}{\sum_{}\; {p\left( {\left.  \middle| ^{i} \right.,\mathcal{L}^{i},} \right)}}} & {(20)}\end{matrix}$

with the last equality due to the assumption that p(

|X,

) is uniform. We can next decompose the measurement likelihood p(

|X,

,

)=Π_(k)p(z_(k)|x_(α) _(k) ,

_(β) _(k) ), and so

$\begin{matrix}{{{^{i + 1},\mathcal{L}^{i + 1}}\quad}{\quad{= {{{\quad\quad}\underset{,\mathcal{L}}{\quad\arg \mspace{11mu} \max}{\sum\limits_{ \in }{{p\left( {\left.  \middle| ^{i} \right.,\mathcal{L}^{i},} \right)}p\left( {\left.  \middle|  \right.,\mathcal{L},D} \right)}}} = {\underset{,\mathcal{L}}{\arg \mspace{14mu} \max} {\sum\limits_{ \in }{\prod\limits_{k} \frac{{p\left( {{z_{k}x_{\alpha_{k}}^{i}},_{\beta_{k}}^{i}} \right)}{p\left( {{z_{k}x_{\alpha_{k}}},_{\beta_{k}}} \right)}}{\sum_{}{p\left( {{^{i}},\mathcal{L}^{i},} \right)}}}}}}}}} & (21)\end{matrix}$

The result then follows by noting that the normalizing denominator isindependent of the optimization variables and from the definition of thematrix permanent.

Appendix II: Proof of Proposition 2

Suppose we have some initial guess given by θ^((i))−{X^((i)),

^(p,(i))}. We can then compute an improved estimate of θ={X, l^(p)} bymaximizing the expected log likelihood:

$\begin{matrix}{\theta^{({i + 1})} = {\arg \mspace{14mu} {\max\limits_{\theta}{_{,{^{c}\theta^{(i)}}}\left\lbrack {\log \mspace{11mu} {p\left( {,^{c},,,{\mathcal{I}\theta}} \right)}} \right\rbrack}}}} & (22)\end{matrix}$

Expanding the expectation,

$\begin{matrix}{{_{,{^{c}\theta^{(i)}}}\left\lbrack {\log \mspace{14mu} {p\left( {,^{c},,,{\mathcal{I}\theta}} \right)}} \right\rbrack} =} & (23)\end{matrix}$

Letting k(

,

^(c))

p(

,

^(c)|S, θ^((i))), a constant with respect to the optimization variables,we continue:

$\begin{matrix}{{\lbrack \cdot \rbrack} = {{{\sum\limits_{,^{c}}{{\kappa \left( {,^{c}} \right)}\log \mspace{14mu} {p\left( {,,{^{c}\theta}} \right)}}} + {\sum\limits_{,^{c}}{{\kappa \left( {,^{c}} \right)}{\log \;\left\lbrack {{p\left( {\theta} \right)}{p\left( {\mathcal{I}\theta} \right)}} \right\rbrack}}}} = {{\sum\limits_{,^{c}}{{\kappa \left( {,^{c}} \right)}\log \mspace{14mu} {p\left( {,,{^{c}\theta}} \right)}}} + {\log \mspace{14mu} {p\left( {\theta} \right)}} + {\log \mspace{11mu} {p\left( {\mathcal{I}\theta} \right)}}}}} & (24)\end{matrix}$

Focusing on the leftmost summation over data associations and landmarkclasses,

$\begin{matrix}{{\sum\limits_{,^{c}}{{\kappa \left( {,^{c}} \right)}\log \mspace{14mu} {p\left( {,,{^{c}\theta}} \right)}}} = {{\sum\limits_{,^{c}}{{\kappa \left( {,^{c}} \right)}\log \mspace{14mu} {p\left( {{},^{c},\theta} \right)}}} + {\sum\limits_{,^{c}}{{\kappa \left( {,^{c}} \right)}\log \mspace{14mu} {p\left( {,{^{c}\theta}} \right)}}}}} & (25)\end{matrix}$

Using the assumption that p(

,

^(c)|θ) is a uniform distribution over the space of data associationsand landmark classes, this term doesn't affect which θ maximizes theobjective, so for optimization purposes we have

$\begin{matrix}{{\sum\limits_{,^{c}}{{\kappa \left( {,^{c}} \right)}\log \mspace{14mu} {p\left( {,,{^{c}\theta}} \right)}}} = {\sum\limits_{,^{c}}{{\kappa \left( {,^{c}} \right)}\log \mspace{14mu} {p\left( {{},{^{c}\theta}} \right)}}}} & {{~~~~~~~~~~~}(26)} \\{= {\sum\limits_{t}{\sum\limits_{i}\sum\limits_{_{t},^{c}}}}} & {(27)} \\{{{\kappa \left( {_{t},^{c}} \right)}\log \mspace{14mu} {p\left( {{s_{i}x_{t}},_{\beta_{i}}} \right)}}} & \end{matrix}$

Note that if we let

(i, j) be the subset of all possible data associations that assignmeasurement i to landmark j, we can further decompose this summation as

$\begin{matrix}{{\sum\limits_{,^{c}}{{\kappa \left( {,^{c}} \right)}\log \mspace{14mu} {p\left( {,,{^{c}\theta}} \right)}}} = {\sum\limits_{t}{\sum\limits_{i}{\sum\limits_{j}{\sum\limits_{^{c}}{\sum\limits_{_{t} \in {{({i,j})}}}{{\kappa \left( {_{t},^{c}} \right)}\log \mspace{14mu} {p\left( {{s_{i}x_{t}},_{j}} \right)}}}}}}}} & (28)\end{matrix}$

Finally, letting

$w_{ij}^{t}\overset{\Delta}{=}{\sum\limits_{^{c}}{\sum\limits_{_{t} \in {{({i,j})}}}{\kappa \left( {_{t},^{c}} \right)}}}$

we can write the final expectation maximization as

$\begin{matrix}{\theta^{({i + 1})} = {{\arg \mspace{14mu} {\max\limits_{\theta}{\sum\limits_{t}{\sum\limits_{i}{\sum\limits_{j}{w_{ij}^{t}\log \mspace{11mu} {p\left( {{s_{i}x_{t}},_{j}} \right)}}}}}}} + {\log \mspace{11mu} {p\left( {\theta} \right)}} + {\log \mspace{14mu} {p\left( {\mathcal{I}\theta} \right)}}}} & (29)\end{matrix}$

REFERENCES

Each of the following references is hereby incorporated by reference inits entirety.

-   [1] J. A. Hesch, D. G. Kottas, S. L. Bowman, and S. I. Roumeliotis,    “Consistency Analysis and Improvement of Vision-aided Inertial    Navigation,” IEEE Trans. on Robotics (TRO), vol. 30, no. 1, pp.    158-176, 2014.-   [2] D. G. Kottas and S. I. Roumeliotis, “Efficient and Consistent    Visionaided Inertial Navigation using Line Observations,” in IEEE    Int. Conf. on Robotics and Automation (ICRA), 2013, pp. 1540-1547.-   [3] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, “RGB-D    mapping: Using Kinect-style depth cameras for dense 3D modeling of    indoor environments,” The International Journal of Robotics Research    (IJRR), vol. 31, no. 5, pp. 647-663, 2012.-   [4] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan,    “Object Detection with Discriminatively Trained Part-Based Models,”    IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI),    vol. 32, no. 9, pp. 1627-1645, 2010.-   [5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards    real-time object detection with region proposal networks,” in    Advances in Neural Information Processing Systems (NIPS), 2015.-   [6] P. Agrawal, R. Girshick, and J. Malik, “Analyzing the    performance of multilayer neural networks for object recognition,”    in Computer Vision—ECCV 2014. Springer, 2014, pp. 329-344.-   [7] X. Liu, Y. Zhao, and S.-C. Zhu, “Single-view 3d scene parsing by    attributed grammar,” in Computer Vision and Pattern Recognition    (CVPR), 2014 IEEE Conference on. IEEE, 2014, pp. 684-691.-   [8] X. Chen, K. Kundu, Y. Zhu, A. Berneshawi, H. Ma, S. Fidler,    and R. Urtasun, “3d object proposals for accurate object class    detection,” in NIPS, 2015.-   [9] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and    mapping: part i,” Robotics Automation Magazine, IEEE, vol. 13, no.    2, pp. 99-110, June 2006.-   [10] F. Lu and E. Milios, “Globally Consistent Range Scan Alignment    for Environment Mapping,” Auton. Robots, vol. 4, no. 4, pp. 333-349,    1997.-   [11] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W.    Burgard, “g2o: A General Framework for Graph Optimization,” in IEEE    Int. Conf. on Robotics and Automation (ICRA), 2011, pp. 3607-3613.-   [12] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. Leonard, and F.    Dellaert, “iSAM2: Incremental Smoothing and Mapping Using the Bayes    Tree,” The International Journal of Robotics Research (IJRR), vol.    31, no. 2, pp. 216-235, 2012.-   [13] C. Galindo, A. Saffiotti, S. Coradeschi, P. Buschka, J.    Fernandez-Madrigal, and J. Gonzalez, “Multi-hierarchical Semantic    Maps for Mobile Robotics,” in IEEE/RSJ Int. Conf. on Intelligent    Robots and Systems (IROS), 2005, pp. 2278-2283.-   [14] J. Civera, D. Galvez-Lopez, L. Riazuelo, J. Tardos, and J.    Montiel, “Towards Semantic SLAM Using a Monocular Camera,” in    IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, 2011, pp.    1277-1284.-   [15] A. Pronobis, “Semantic Mapping with Mobile Robots,”    dissertation, KTH Royal Institute of Technology, 2011.-   [16] J. Stückler, B. Waldvogel, H. Schulz, and S. Behnke, “Dense    real-time mapping of object-class semantics from RGB-D video,”    Journal of Real-Time Image Processing, pp. 1-11, 2013.-   [17] V. Vineet, O. Miksik, M. Lidegaard, M. Niesner, S.    Golodetz, V. A. Prisacariu, O. Kähler, D. W. Murray, S. Izadi, P.    Perez, and P. H. S. Torr, “Incremental dense semantic stereo fusion    for large-scale semantic scene reconstruction,” in IEEE    International Conference on Robotics and Automation (ICRA), 2015.-   [18] B. Leibe, N. Cornelis, K. Cornelis, and L. Van Gool, “Dynamic    3d scene analysis from a moving vehicle,” in IEEE Conf. on Computer    Vision and Pattern Recognition (CVPR), June 2007, pp. 1-8.-   [19] S. Pillai and J. Leonard, “Monocular slam supported object    recognition,” in Proceedings of Robotics: Science and Systems (RSS),    Rome, Italy, July 2015.-   [20] N. Atanasov, M. Zhu, K. Daniilidis, and G. Pappas, “Semantic    Localization Via the Matrix Permanent,” in Robotics: Science and    Systems (RSS), 2014.-   [21] S. Bao and S. Savarese, “Semantic Structure from Motion,” in    IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2011,    pp. 2025-2032.-   [22] R. Salas-Moreno, R. Newcombe, H. Strasdat, P. Kelly, and A.    Davison, “SLAM++: Simultaneous Localisation and Mapping at the Level    of Objects,” in IEEE Conf. on Computer Vision and Pattern    Recognition (CVPR), 2013, pp. 1352-1359.-   [23] D. Gálvez-López, M. Salas, J. Tardós, and J. Montiel,    “Real-time Monocular Object SLAM,” arXiv:1504.02398, 2015.-   [24] I. Reid, “Towards Semantic Visual SLAM,” in Int. Conf. on    Control Automation Robotics Vision (ICARCV), 2014.-   [25] A. Kundu, Y. Li, F. Dellaert, F. Li, and J. Rehg, “Joint    semantic segmentation and 3d reconstruction from monocular video,”    in Computer Vision ECCV 2014, ser. Lecture Notes in Computer    Science, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds.    Springer International Publishing, 2014, vol. 8694, pp. 703-718.    [Online]. Available: http://dx.doi.org/10.1007/978-3-319-10599-445-   [26] I. Kostavelis and A. Gasteratos, “Semantic mapping for mobile    robotics tasks: A survey,” Robotics and Autonomous Systems, vol. 66,    pp. 86-103, 2015.-   [27] J. Neira and J. Tardós, “Data Association in Stochastic Mapping    Using the Joint Compatibility Test,” IEEE Trans. on Robotics and    Automation (TRO), vol. 17, no. 6, pp. 890-897, 2001.-   [28] J. Munkres, “Algorithms for the Assignment and Transportation    Problems,” Journal of the Society for Industrial & Applied    Mathematics (SIAM), vol. 5, no. 1, pp. 32-38, 1957.-   [29] N. Atanasov, M. Zhu, K. Daniilidis, and G. Pappas,    “Localization from semantic observations via the matrix permanent,”    The International Journal of Robotics Research, vol. 35, no. 1-3,    pp. 73-99, 2016.-   [30] A. Mourikis and S. I. Roumeliotis, “A multi-state constraint    kalman filter for vision-aided inertial navigation,” in Robotics and    Automation, 2007 IEEE International Conference on. IEEE, 2007, pp.    3565-3572.-   [31] M. Bloesch, S. Oman, M. Nutter, and R. Siegwart, “Robust visual    inertial odometry using a direct ekf-based approach,” in IEEE/RSJ    Int. Conf. on Intelligent Robots and Systems (IROS), 2015, pp.    298-304.-   [32] C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast    semi-direct monocular visual odometry,” in IEEE Int. Conf. on    Robotics and Automation (ICRA), 2014, pp. 15-22.-   [33] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An    efficient alternative to sift or surf,” in Int. Conf. on Computer    Vision, 2011, pp. 2564-2571.-   [34] S. Gidaris and N. Komodakis, “Object detection via a    multi-region and semantic segmentation-aware cnn model,” in IEEE    Int. Conf. on Computer Vision, 2015, pp. 1134-1142.-   [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning    for image recognition,” arXiv preprint arXiv:1512.03385, 2015.-   [36] Z. Cai, Q. Fan, R. Feris, and N. Vasconcelos, “A unified    multi-scale deep convolutional neural network for fast object    detection,” in European Conference on Computer Vision (ECCV), 2016.-   [37] M. Zhu, N. Atanasov, G. Pappas, and K. Daniilidis, “Active    Deformable Part Models Inference,” in European Conference on    Computer Vision (ECCV), ser. Lecture Notes in Computer Science.    Springer, 2014, vol. 8695, pp. 281-296.-   [38] C. Dubout and F. Fleuret, “Deformable part models with    individual part scaling,” in British Machine Vision Conference, no.    EPFL-CONF-192393, 2013.-   [39] F. Dellaert, “Factor graphs and gtsam: A hands-on    introduction,” GT RIM, Tech. Rep. GT-RIM-CP&R-2012-002,    September 2012. [Online]. Available:    https://research.cc.gatech.edu/borg/sites/edu.borg/files/downloads/gtsam.pdf-   [40] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “Imu    preintegration on manifold for efficient visual-inertial    maximum-a-posteriori estimation,” in Proceedings of Robotics:    Science and Systems, Rome, Italy, July 2015.-   [41] D. G. Kottas, K. Wu, and S. I. Roumeliotis, “Detecting and    dealing with hovering maneuvers in vision-aided inertial navigation    systems,” in 2013 IEEE/RSJ International Conference on Intelligent    Robots and Systems, November 2013, pp. 3172-3179.-   [42] J. Nikolic, J. Rehder, M. Burn, P. Gohl, S. Leutenegger, P. T.    Furgale, and R. Siegwart, “A synchronized visual-inertial sensor    system with fpga pre-processing for accurate real-time slam,” in    Robotics and Automation (ICRA), 2014 IEEE International Conference    on. IEEE, 2014, pp. 431-437.-   [43] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for    autonomous driving? the kitti vision benchmark suite,” in Conference    on Computer Vision and Pattern Recognition (CVPR), 2012.-   [44] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM: a    versatile and accurate monocular SLAM system,” IEEE Transactions on    Robotics, vol. 31, no. 5, pp. 1147-1163, 2015.-   [45] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: an open-source SLAM    system for monocular, stereo and RGB-D cameras,” CoRR, vol.    abs/1610.06475, 2016. [Online]. Available:    http://arxiv.org/abs/1610.06475-   [46] A. Geiger, J. Ziegler, and C. Stiller, “StereoScan: Dense 3d    Reconstruction in Real-time,” in Intelligent Vehicles Symposium    (IV), 2011, pp. 963-968.

Probabilistic Data Association for Semantic SLAM—Section II I.Probabilistic Data Association in SLAM

Consider the classical localization and mapping problem, in which amobile sensor moves through an unknown environment, modeled as acollection

{

_(m)}_(m=1) ^(M) of M static landmarks. Given a set of sensormeasurements

{z_(k)}_(k=1) ^(K), the task is to estimate the landmark positions

and a sequence of poses X

{x_(t)}_(t=1) ^(T) representing the sensor trajectory. Most existingwork focuses on estimating X and L and rarely emphasizes that the dataassociation

{(α_(k), β_(k))}_(i=1) ^(K) stipulating that measurement Z_(k) oflandmark

_(Bk) was obtained from sensor state x_(α) _(k) is in fact unknown. Acomplete statement of the SLAM problem involves maximum likelihoodestimation of X,

and

given the measurements

:

$\begin{matrix}{\hat{},\hat{\mathcal{L}},{\hat{} = {\underset{,\mathcal{L},}{\arg \mspace{14mu} \max}\mspace{11mu} \log \mspace{14mu} {p\left( {\left.  \middle|  \right.,\mathcal{L},} \right)}}}} & (1)\end{matrix}$

The most common approach to this maximization has been to decompose itinto two separate estimation problems. First, given prior estimates X⁰and

⁰, the maximum likelihood estimate

of the data association

is computed (e.g., via JCBB [1] or the Hungarian algorithm [2]). Then,given

, the most likely landmark and sensor states are estimated

$\begin{matrix}{{\hat{} = {\underset{}{\arg \mspace{14mu} \max}\mspace{14mu} {p\left( {\left.  \middle| ^{0} \right.,\mathcal{L}^{0},} \right)}}}{\hat{},{\hat{\mathcal{L}} = {\underset{,\mathcal{L}}{\arg \mspace{14mu} \max}\mspace{14mu} \log \mspace{14mu} {p\left( {\left.  \middle|  \right.,\mathcal{L},\hat{}} \right)}}}}} & (2)\end{matrix}$

The second optimization above is typically carried out via filtering[4]-[6] or pose-graph optimization [7], [8].

The above process has the disadvantage that an incorrectly chosen dataassociation may have a highly detrimental effect on the estimationperformance. Moreover, if ambiguous measurements are discarded to avoidincorrect association choices, they will never be reconsidered laterwhen refined estimates of the sensor pose (and hence their dataassociation) are available. Instead of a simple one step process, then,it is possible to perform coordinate descent, which iterates the twomaximization steps as follows:

$\begin{matrix}{{^{i + 1} = {\underset{}{\arg \mspace{14mu} \max}\mspace{14mu} {p\left( {\left.  \middle| ^{i} \right.,\mathcal{L}^{i},} \right)}}}{^{i + 1},{\mathcal{L}^{i + 1} = {\underset{,\mathcal{L}}{\arg \mspace{14mu} \max}\mspace{14mu} \log \mspace{14mu} {p\left( {\left.  \middle|  \right.,\mathcal{L},^{i + 1}} \right)}}}}} & (3)\end{matrix}$

This resolves the problem of being able to revisit association decisionsonce state estimates improve but does little to resolve the problem withambiguous measurements since a hard decision on data associations isstill required. To address this, rather than simply selecting

as the mode p(

|X,

,

), we should consider the entire density of

when estimating X and

. Given initial estimates X^(i),

^(i), an improved estimate that utilizes the whole density of

can be computed by maximizing the expected measurement likelihood viaexpectation maximization (EM):

$\begin{matrix}{\begin{matrix}{^{i + 1},{\mathcal{L}^{i + 1} = {\underset{,\mathcal{L}}{\arg \mspace{14mu} \max}\mspace{14mu} {_{}\left\lbrack {\left. {\log \mspace{14mu} {p\left( {\left.  \middle|  \right.,\mathcal{L},} \right)}} \middle| ^{i} \right.,\mathcal{L}^{i},} \right\rbrack}}}} \\{= {\underset{,\mathcal{L}}{\arg \mspace{14mu} \max}{\sum\limits_{ \in }{{p\left( {\left.  \middle| ^{i} \right.,\mathcal{L}^{i},} \right)}\log \mspace{14mu} {p\left( {\left.  \middle|  \right.,\mathcal{L},} \right)}}}}}\end{matrix}\quad} & (4)\end{matrix}$

where

is the space of all possible values of

. This EM formulation has the advantage that no hard decisions on dataassociation are required since it probabilistically “averages” over allpossible associations. To compare this with the coordinate descentformulation in (3), we can rewrite (4). We use the assumption here thatdata associations are independent across time steps, i.e. p(

|

, X,

)=Π_(t=1) ^(T)p(

_(t)|

_(t), X,

) Furthermore, let

_(t)={z_(k) ^(t)}_(k=1) ^(K) ^(t) ⊆

be the set of measurements received at time t.

$\begin{matrix}{{{\underset{,\mathcal{L}}{\arg \mspace{14mu} \max}{\sum\limits_{t = 1}^{T}{\sum\limits_{_{t} \in _{t}}{\sum\limits_{k = 1}^{K_{t}}{{p\left( {\left. _{t} \middle| X^{i} \right.,\mathcal{L}^{i},_{t}} \right)}\log \mspace{14mu} {p\left( {\left. z_{k} \middle| x_{t} \right.,_{\beta_{k}}} \right)}}}}}} = {\underset{,\mathcal{L}}{\arg \mspace{14mu} \max}{\sum\limits_{t = 1}^{T}{\overset{K_{t}}{\sum\limits_{k = 1}}{\sum\limits_{j = 1}^{M}{w_{kj}^{t,i}\log \mspace{14mu} {p\left( {\left. z_{k}^{t} \middle| x_{t} \right.,_{j}} \right)}}}}}}}\mspace{20mu} {where}} & (5) \\{\mspace{79mu} {w_{kj}^{t,i}\overset{\Delta}{=}{\sum\limits_{_{t} \in {_{t}{({k,j})}}}{p\left( {\left. _{t} \middle| ^{i} \right.,\mathcal{L}^{i},_{t}} \right)}}}} & (6)\end{matrix}$

a weight, independent of the optimization variables X and

, that quantifies the influence of the “soft” data association,

_(t)⊆

is the set of all possible data associations for measurements receivedat time t, and

_(t)(k, j)

{

_(t) ∈

_(t)|β_(k)=j}⊆

_(t) is the set of all data associations such that measurement k isassigned to landmark j. Note that the coordinate descent optimization(3) has a similar form to (8), except that for each k there is exactlyone j such that w_(kj) ^(i)=1 and w_(kl) ^(i)=0 for all l≠j.

Because of the combinatorial size of the data association space

, for a large number of measurements

the weights w_(kj) ^(t,i) can become intractible to compute. It ispossible, however, to express this quantity as a matrix permanent.

Proposition 1. Optimization (4) can be solved iteratively as follows:

$\begin{matrix}{{w_{kj}^{t,i} = {\gamma_{t}^{i}l_{kj}^{t}\mspace{14mu} {per}\mspace{14mu} L_{- {kj}}^{t}}}{^{i + 1},{\mathcal{L}^{i + 1} = {\underset{,\mathcal{L}}{\arg \mspace{14mu} \max}{\sum\limits_{t = 1}^{T}{\overset{K_{t}}{\sum\limits_{k = 1}}{\sum\limits_{j = 1}^{M}{w_{kj}^{t,i}\log \mspace{14mu} {p\left( {\left. z_{k}^{t} \middle| x_{t} \right.,_{j}} \right)}}}}}}}}} & (8)\end{matrix}$

where γ_(t) ^(i) is a normalizing factor such that Σ_(k)w_(kj) ^(t,i)=1,per denotes the matrix permanent, L^(t) is the matrix of individualmeasurement likelihoods with l_(ij) ^(t)=p(z_(i) ^(t)|x_(t),

_(j)) and L_(−ij) ^(t) the matrix L^(t) with the ith row and jth columnremoved.Proof. See Appendix A.

Crucially, the above proposition allows us to take advantage of matrixpermanent approximation algorithms [9], [10] that have been developed.Proposition 1 thus allows us to effectively summarize thecombinatorially large data association space in polynomial time, makingprobabilistic data association feasible for even a large number ofmeasurements.

Similar to the coordinate descent formulation, the EM formulation (8)allows us to solve the permanent maximization problem iteratively.First, instead of estimating a maximum likelihood data association, weestimate the data association distribution p(

|X^(i),

^(i),

) in the form of the weights w_(kj) ^(i) (the “E” step). Then, wemaximize the expected measurement log likelihood over the previouslycomputed distribution (the “M” step).

While this approach allows us to include measurements of landmarks inthe map without computing an explicit data association, the question ofhow to include initialization of new landmarks remains. Past approachesmake a hard decision based on e.g. the Mahalanobis distance between themeasurement and the expected measurement from all known landmarks. Ifthe smallest such distance is above a threshold, a new landmark isinitialized, and the optimization proceeds as described above.

The “soft” data association formulated above, however, is generalizableto richer data association spaces and models, including that of newlandmark initialization. Previously, we had considered the dataassociation space

as the set of one-to-one mappings from the measurements {1, 2, . . . , |

|} onto known landmarks {1, . . . , M}. Here, we expand the codomain offunctions in

from {1, . . . , M}. We now let a measurement z_(k) map to an existinglandmark in {1, . . . , M}, or to a special “new” association n_(k)corresponding to measurement k being of a previously unseen landmark.

Let S_(d)(t) be the set of landmarks that generated a measurement attime t, the “detected” set, and let S_(n)(t) be the set of landmarksthat were newly observed at time t. Let p_(tj) be the probability that agiven landmark

_(j) is detected from sensor state x_(t). For a fixed x,

, we assume that the probability that

is detected from x is given by a known function p_(D)(x,

). As our estimates are uncertain, however, we must take this intoaccount in the computation of p_(tj):

p _(tj)=∫_(x)

p _(D)(x,

)p(x,

)d

dx   (9)

In practice, this quantity can be estimated accurately via Monte Carlointegration with relatively few samples by drawing from the jointestimated distribution p(x,

):

$\begin{matrix}{{\hat{p}}_{tj} = {\frac{1}{N}{\sum\limits_{i = 1}^{N}{p_{D}\left( {x_{i},_{i}} \right)}}}} & (10)\end{matrix}$

where (x_(i),

_(i))_(i=1) ^(N) are N i.i.d. samples drawn from current estimateddistribution of x,

p(x,

).

As in [11], we model the “birth” of new landmarks at time step t as aPoisson random finite set S_(n)(t). The cardinality of S_(n)(t) andhence the number of new landmarks observed at any time step t is thusPoisson distributed with mean λ_(n). For a new landmark

∈ S_(n)(t) that generates measurement z, we assume that the measurementz is spatially distributed with distribution p_(new)(z) which can betaken to be e.g. uniform over the measurement space.

Let

_(n) _(k) be a possibly-existing landmark that is newly initialized frommeasurement k. To simplify notation and exposition, we let

_(M+1)=

_(n) ₁ ,

_(M+2)=

_(n) ₂ , . . . ,

_(M+K)=

_(n) _(K) . We then have the following result.

Proposition 2. Optimization (4) under the expanded data associationspace can be solved as follows:

$\begin{matrix}{{w_{kj}^{i} = {\gamma_{t}^{i}q_{kj}^{t}\mspace{14mu} {per}\mspace{14mu} Q_{- {kj}}^{t}}}{^{i + 1},{\mathcal{L}^{i + 1} = {\underset{,\mathcal{L}}{\arg \mspace{14mu} \max}{\sum\limits_{t = 1}^{T}{\overset{K_{t}}{\sum\limits_{k = 1}}{\sum\limits_{j = 1}^{M + K_{t}}{w_{kj}^{t,i}\log \mspace{14mu} {p\left( {\left. z_{k}^{t} \middle| x_{t} \right.,_{j}} \right)}}}}}}}}} & (12)\end{matrix}$

where γ_(t) ^(i) is a normalizing factor such that Σ_(k) w_(kj) ^(i)=1.

Q^(t)=[G^(t) H^(t)]  (13)

G^(t) ∈

^(K) ^(t) ^(×|M) ^(d) ^((t)|) is a matrix of scaled measurementlikelihoods with

$\begin{matrix}{g_{kj}^{t} = {\frac{p_{tj}}{\lambda_{n}\left( {1 - p_{tj}} \right)}{p\left( {\left. z_{k} \middle| x_{t} \right.,_{j}} \right)}}} & (14)\end{matrix}$

H^(t) ∈

^(K) ^(t) ^(×K) ^(t) is a diagonal matrix of birthed landmarklikelihoods:

H

diag([p_(new)(z₁) . . . p_(new)(z_(K))])   (15)

and Q_(−kj) is the matrix Q with the kth row and jth column removed.Proof. See Appendix B

Intuitively, this approach can be viewed as follows. For eachmeasurement z_(k), a new landmark

_(n) _(k) is initialized from the measurement. The data associationdistribution weights w are then computed over both landmarks previouslyin the map

_(j) and the newly initialized landmarks

_(n) _(k) while accounting for the probabilities of detecting knownlandmarks and of observing new ones.

Appendix A Proof of Proposition 1

The weights w_(kj) ^(t,i) are defined as (see (6))

$\begin{matrix}{w_{kj}^{t,i}\overset{\Delta}{=}{\sum\limits_{_{t} \in {_{t}{({k,j})}}}{{p\left( {\left. _{t} \middle| ^{i} \right.,\mathcal{L}^{i},_{t}} \right)}.}}} & (16)\end{matrix}$

We can write an individual data association probability as

$\begin{matrix}{\begin{matrix}{{p\left( {\left.  \middle|  \right.,\mathcal{L},} \right)} = \frac{{p\left( {\left.  \middle|  \right.,\mathcal{L},} \right)}{p\left( {\left.  \middle|  \right.,\mathcal{L}} \right)}}{p\left( {\left.  \middle| X \right.,\mathcal{L}} \right)}} \\{= \frac{{p\left( {\left.  \middle|  \right.,\mathcal{L},} \right)}{p\left( {\left.  \middle|  \right.,\mathcal{L}} \right)}}{\sum_{^{\prime}}{{p\left( {\left.  \middle|  \right.,\mathcal{L},^{\prime}} \right)}{p\left( {\left. ^{\prime} \middle|  \right.,\mathcal{L}} \right)}}}}\end{matrix}\quad} & (18)\end{matrix}$

Assuming that the likelihood of any given data association not given anymeasurements, p(

|X,

) is uniform, let γ be a constant proportionality factor

$\begin{matrix}{{\gamma_{t}^{i}\overset{\Delta}{=}\frac{1}{\sum_{_{t}^{\prime}}{p\left( {\left. _{t} \middle| X^{i} \right.,\mathcal{L}^{i},_{t}^{\prime}} \right)}}}{{so}\mspace{14mu} {that}}} & (19) \\{w_{kj}^{t,i} = {\gamma_{t}^{i}{\sum\limits_{_{t} \in {_{t}{({k,j})}}}{{p\left( {\left. _{t} \middle|  \right.,\mathcal{L},_{t}} \right)}.}}}} & (20)\end{matrix}$

Now, given a data association, individual measurements are independentand so we can expand

$\begin{matrix}{{w_{kj}^{t,i} = {\gamma_{t}^{i}{\sum\limits_{_{t} \in {_{t}{({k,j})}}}{\prod\limits_{s = 1}^{K_{t}}\; {p\left( {\left. z_{s}^{t} \middle| x_{t} \right.,_{\beta_{s}}} \right)}}}}}{{{{{For}\mspace{14mu} {all}\mspace{14mu} _{t}} \in {{_{t}\left( {k,j} \right)}\mspace{14mu} {we}\mspace{14mu} {have}\mspace{14mu} \beta_{k}}} = {j\mspace{14mu} {by}\mspace{14mu} {definition}}},{so}}} & (21) \\{w_{kj}^{t,i} = {\gamma_{t}^{i}\mspace{11mu} {p\left( {\left. z_{k}^{t} \middle| x_{t} \right.,_{j}} \right)}{\sum\limits_{_{t} \in {_{t}{({k,j})}}}{\prod\limits_{s \neq k}\; {p\left( {\left. z_{s}^{t} \middle| x_{t} \right.,_{\beta_{s}}} \right)}}}}} & (22)\end{matrix}$

From the definition of the matrix permanent,

$\begin{matrix}{{{per}\mspace{14mu} L} = {\sum\limits_{\pi}{\prod\limits_{s = 1}^{K_{t}}\; l_{s,{\pi {(s)}}}^{t}}}} & (23)\end{matrix}$

where the first sum is over all one-to-one functions π: {1, . . . ,K_(t)}→{1, . . . , M}. This is exactly our definition of a valid dataassociation, so

$\begin{matrix}{\begin{matrix}{{{per}\mspace{14mu} L} = {\sum\limits_{ \in }{\prod\limits_{s = 1}^{K_{t}}\; l_{s,\beta_{s}}^{t}}}} \\{= {\sum\limits_{ \in }{\prod\limits_{s = 1}^{K_{t}}{p\left( {\left. z_{s}^{t} \middle| x_{t} \right.,_{\beta_{s}}} \right)}}}}\end{matrix}\quad} & \begin{matrix}(24) \\\; \\\; \\(25)\end{matrix}\end{matrix}$

Similarly, the permanent of L_(−kj) ^(t) will include a sum over allone-to-one functions π: {1, . . . , K_(t)}\{k}→{1, . . . , M}\{j}. It isnow easy to see that

$\begin{matrix}{{{per}\mspace{14mu} L_{- {kj}}^{t}} = {\sum\limits_{_{t} \in {_{t}{({k,j})}}}\; {\prod\limits_{s \neq k}\; {p\left( {\left. z_{s}^{t} \middle| x_{t} \right.,_{\beta_{s}}} \right)}}}} & (26)\end{matrix}$

and so we have the final expression

$\begin{matrix}{\begin{matrix}{w_{kj}^{t,i} = {\gamma_{t}^{i}{p\left( {\left. z_{k}^{t} \middle| x_{t} \right.,_{j}} \right)}{per}\mspace{14mu} L_{- {kj}}^{t}}} \\{= {\gamma_{t}^{i}l_{kj}^{t}{per}\mspace{14mu} L_{- {kj}}^{t}}}\end{matrix}\quad} & (28)\end{matrix}$

Appendix B Proof of Proposition 2

First, note that equations (8), (6) still hold under the expanded dataassociation definition. We can expand the likelihood for a single dataassociation p(

_(t)|X,

,

) as

$\begin{matrix}{{p\left( {\left.  \middle|  \right.,\mathcal{L},} \right)} = \frac{{p\left( {\left.  \middle|  \right.,\mathcal{L},} \right)}{p\left( {\left.  \middle|  \right.,\mathcal{L}} \right)}}{p\left( {\left.  \middle|  \right.,\mathcal{L}} \right)}} & (29)\end{matrix}$

Let

(

) be the event that the received measurements Z are assigned to the setof detected landmarks S_(d)(t) in the way specified by the dataassociation

. We can decompose the probability p(

|X,

) as follows:

$\begin{matrix}{{p\left( {\left.  \middle|  \right.,\mathcal{L}} \right)} = {{P\left( { \in {{S_{d}(t)}\mspace{14mu} {are}\mspace{14mu} {detected}}} \right)} \times {P\left( { \in {{_{1:M}\backslash {S_{d}(t)}}\mspace{14mu} {are}\mspace{14mu} {missed}}} \right)} \times {P\left( { \in {{S_{d}(t)}\mspace{14mu} {are}\mspace{14mu} {new}}} \right)} \times {P\left( {()} \right)}}} & (30)\end{matrix}$

With our assumed detection model, the first term can be computed as

$\begin{matrix}{{P\left( { \in {{S_{d}(t)}\mspace{14mu} {are}\mspace{14mu} {detected}}} \right)} = {\prod\limits_{ \in {S_{d}{(t)}}}\; {p_{D}\left( {x_{t},} \right)}}} & (31)\end{matrix}$

The second term is similarly computed as

$\begin{matrix}{{P\left( { \in {{_{1:M}\backslash S_{d}}\mspace{14mu} {are}\mspace{14mu} {missed}}} \right)} = {\prod\limits_{ \in {_{1:M}\backslash S_{d}}}\left( {1 - {p_{D}\left( {x,} \right)}} \right)}} & (32)\end{matrix}$

The third term follows from the assumption that newly detected landmarksfollow a Poisson random finite set:

$\begin{matrix}{{{P\left( { \in {{S_{d}(t)}\mspace{14mu} {are}\mspace{14mu} {new}}} \right)} = \frac{e^{- \lambda_{n}}\lambda_{n}^{n}}{n!}},} & (33)\end{matrix}$

where n=|S_(n)(t)|.

The last term is simply uniform over the space of possible assignments,or

$\begin{matrix}{\begin{matrix}{{P\left( {()} \right)} = \frac{1}{K_{t}P_{d}}} \\{= \frac{\left( {K_{t} - d} \right)!}{K_{t}!}}\end{matrix}{{{where}\mspace{14mu} d} = {{{S_{d}(t)}}.}}} & \begin{matrix}(34) \\\; \\(35) \\\;\end{matrix}\end{matrix}$

Combining these,

$\begin{matrix}{{p\left( {\left.  \middle|  \right.,\mathcal{L}} \right)} = {\frac{\lambda_{n}^{n}e^{- \lambda_{n}}}{K_{t}!}{\prod\limits_{ \in _{1:M}}{\left( {1 - {p_{D}\left( {x_{t},} \right)}} \right){\prod\limits_{ \in {S_{d}{(t)}}}\frac{p_{D}\left( {x_{t},} \right)}{1 - {p_{D}\left( {x_{t},} \right)}}}}}}} & (36)\end{matrix}$

Next, given a particular data association

, individual measurements are independent, and so we expand themeasurement likelihoods p(

|X,

,

). Letting z(

) be the measurement assigned to landmark

under

,

$\begin{matrix}\begin{matrix}{{p\left( {\left.  \middle|  \right.,\mathcal{L},} \right)} = {\prod\limits_{k}\; {p\left( {\left. z_{k} \middle| x \right.,_{\beta_{k}}} \right)}}} \\{= {\prod\limits_{ \in S_{d}}{{p\left( {\left. {z()} \middle| x \right.,} \right)}}}} \\{= {\prod\limits_{ \in S_{d}}{{p\left( {\left. {z()} \middle| x \right.,} \right)}}}}\end{matrix} & \begin{matrix}(37) \\\; \\(38) \\\; \\(39)\end{matrix}\end{matrix}$

The weights w_(kj) can then be computed as

$\begin{matrix}\begin{matrix}{w_{kj}^{i}\overset{\Delta}{=}{\sum\limits_{ \in {{({k,j})}}}{p\left( {\left.  \middle| ^{i} \right.,\mathcal{L}^{i},} \right)}}} & {(40)} \\{= {\sum\limits_{ \in {{({k,j})}}}\frac{{p\left( {,,\mathcal{L},} \right)}{p\left( {\left.  \middle|  \right.,\mathcal{L}} \right)}}{p\left( {\left.  \middle|  \right.,\mathcal{L}} \right)}}} & {(41)} \\{= {\gamma_{t}{\sum\limits_{ \in {{({k,j})}}}{\prod\limits_{ \in S_{d}}{\frac{{p_{D}\left( {x,} \right)}{p\left( {\left. {z()} \middle| x \right.,} \right)}}{\lambda_{n}\left( {1 - {p_{D}\left( {x,} \right)}} \right)}{\prod\limits_{ \in S_{n}}{p_{new}\left( {z()} \right)}}}}}}} & {(42)} \\{{= {\gamma_{t}{p\left( {\left. z_{k} \middle| x \right.,_{j}} \right)}{\sum\limits_{ \in {{({k,j})}}}{\prod\limits_{\underset{{z{()}} \neq z_{k}}{ \in S_{d}}}{\frac{{p_{D}\left( {x,} \right)}{p\left( {\left. {z()} \middle| x \right.,} \right)}}{\lambda_{n}\left( {1 - {p_{D}\left( {x,} \right)}} \right)}\prod\limits_{\underset{{z{()}} \neq z_{k}}{ \in S_{n}}}}}}}}} & {{~~~}(43)}\end{matrix} & \; \\{\mspace{79mu} {{where}\mspace{79mu} {{\gamma \; t} = {\frac{\lambda_{n}^{K}e^{- \lambda_{n}}}{{K!}{p\left( {\left.  \middle|  \right.,\mathcal{L}} \right)}}{\prod\limits_{ \in _{1:M}}{\left( {1 - {p_{D}\left( {x,} \right)}} \right)\mspace{230mu} (44)}}}}}} & \;\end{matrix}$

is a normalizing factor independent of k, j, and any particular dataassociation.

Let G ∈

^(K×d) be a matrix of scaled measurement likelihoods with entries

$\begin{matrix}{g_{kj}\overset{\Delta}{=}\frac{{p_{D}\left( {x,_{j}} \right)}{p\left( {\left. z_{k} \middle| x \right.,_{j}} \right)}}{\lambda_{n}\left( {1 - {p_{D}\left( {x,_{j}} \right)}} \right)}} & (45)\end{matrix}$

let H ∈

^(K×K) be a diagonal matrix of birthed landmark measurementprobabilities,

H

diag([p_(new)(z₁) . . . p_(new)(z_(K))])   (46)

and let Q ∈

^(K×(d+K)) be given by the concatenation of G and H:

Q

[G H]  (47)

The permanent of Q is defined as

$\begin{matrix}{\sum\limits_{\pi}{\prod\limits_{s = 1}^{K}\; q_{s,{\pi {(s)}}}}} & (48)\end{matrix}$

where the sum is over all one-to-one functions π: {1, . . . , K}→{1, . .. , d, n₁, . . . , n_(K)}. This function space exactly matches the spaceof valid data associations. Thus, the permanent of Q can be written as

$\begin{matrix}\begin{matrix}{{{{per}\mspace{14mu} Q} = {\sum\limits_{ \in }{\prod\limits_{s = 1}^{K}\; q_{s}}}},\beta_{s}} & {(49)} \\{{= {\sum\limits_{ \in }{\prod\limits_{s:{\beta_{s} \leq K}}\; q_{s}}}},{\beta_{s}{\prod\limits_{s:{\beta_{s} > K}}\; q_{s}}},\beta_{s}} & {(50)} \\{{= {\sum\limits_{ \in }{\prod\limits_{ \in S_{d}}{\frac{{p_{D}\left( {x,_{j}} \right)}{p\left( {\left. {z()} \middle| x \right.,_{j}} \right)}}{\lambda_{n}\left( {1 - {p_{D}\left( {x,_{j}} \right)}} \right)}{\prod\limits_{ \in S_{n}}{p_{new}\left( {z()} \right)}}}}}}\mspace{14mu}} & {{~~~~~~~~~}(51)}\end{matrix} & \;\end{matrix}$

It is now easy to see that

$\begin{matrix}{{{per}\mspace{14mu} Q_{- {kj}}} = {\sum\limits_{ \in {{({k,j})}}}{\prod\limits_{\underset{{z{()}} \neq z_{k}}{ \in S_{d}}}{\frac{{p_{D}\left( {x,} \right)}{p\left( {\left. {z()} \middle| x \right.,} \right)}}{\lambda_{n}\left( {1 - {p_{D}\left( {x,} \right)}} \right)}{\prod\limits_{\underset{{z{()}} \neq z_{k}}{ \in S_{n}}}{p_{new}\left( {z()} \right)}}}}}} & (52)\end{matrix}$

and so we have that

w _(kj) ^(i) =γq _(kj) per Q _(−kj)   (53)

as desired.

REFERENCES

Each of the following references is hereby incorporated by reference inits entirety.

-   [1] J. Neira and J. Tardós, “Data Association in Stochastic Mapping    Using the Joint Compatibility Test,” IEEE Trans. on Robotics and    Automation (TRO), vol. 17, no. 6, pp. 890-897, 2001.-   [2] J. Munkres, “Algorithms for the Assignment and Transportation    Problems,” Journal of the Society for Industrial & Applied    Mathematics (SIAM), vol. 5, no. 1, pp. 32-38, 1957.-   [3] N. Atanasov, M. Zhu, K. Daniilidis, and G. Pappas, “Localization    from semantic observations via the matrix permanent,” The    International Journal of Robotics Research, vol. 35, no. 1-3, pp.    73-99, 2016.-   [4] A. Mourikis and S. Roumeliotis, “A multi-state constraint kalman    filter for vision-aided inertial navigation,” in Robotics and    Automation, 2007 IEEE International Conference on. IEEE, 2007, pp.    3565-3572.-   [5] M. Bloesch, S. Oman, M. Nutter, and R. Siegwart, “Robust visual    inertial odometry using a direct ekf-based approach,” in IEEE/RSJ    Int. Conf. on Intelligent Robots and Systems (IROS), 2015, pp.    298-304.-   [6] C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast    semi-direct monocular visual odometry,” in IEEE Int. Conf. on    Robotics and Automation (ICRA), 2014, pp. 15-22.-   [7] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. Leonard, and F.    Dellaert, “iSAM2: Incremental Smoothing and Mapping Using the Bayes    Tree,” The International Journal of Robotics Research (IJRR), vol.    31, no. 2, pp. 216-235, 2012.-   [8] R. lcummerle, G. Grisetti, H. Strasdat, K. Konolige, and W.    Burgard, “g2o: A General Framework for Graph Optimization,” in IEEE    Int. Conf. on Robotics and Automation (ICRA), 2011, pp. 3607-3613.-   [9] M. Jerrum, A. Sinclair, and E. Vigoda, “A polynomial-time    approximation algorithm for the permanent of a matrix with    nonnegative entries,” J. ACM, vol. 51, no. 4, pp. 671-697,    July 2004. [Online]. Available:    http://doi.acm.org/10.1145/1008731.1008738-   [10] W. J. Law, “Approximately counting perfect and general    matchings in bipartite and general graphs,” Ph.D. dissertation, Duke    University, 2009.-   [11] B. N. Vo and W. K. Ma, “The gaussian mixture probability    hypothesis density filter,” IEEE Transactions on Signal Processing,    vol. 54, no. 11, pp. 4091-4104, November 2006.

Although specific examples and features have been described above, theseexamples and features are not intended to limit the scope of the presentdisclosure, even where only a single example is described with respectto a particular feature. Examples of features provided in the disclosureare intended to be illustrative rather than restrictive unless statedotherwise. The above description is intended to cover such alternatives,modifications, and equivalents as would be apparent to a person skilledin the art having the benefit of this disclosure.

The scope of the present disclosure includes any feature or combinationof features disclosed in this specification (either explicitly orimplicitly), or any generalization of features disclosed, whether or notsuch features or generalizations mitigate any or all of the problemsdescribed in this specification. Accordingly, new claims may beformulated during prosecution of this application (or an applicationclaiming priority to this application) to any such combination offeatures. In particular, with reference to the appended claims, featuresfrom dependent claims may be combined with those of the independentclaims and features from respective independent claims may be combinedin any appropriate manner and not merely in the specific combinationsenumerated in the appended claims.

What is claimed is:
 1. A method for simultaneous location and mapping(SLAM), the method comprising: receiving, by at least one processor, aset of sensor measurements from a movement sensor of a mobile robot anda set of images captured by a camera on the mobile robot as the mobilerobot traverses an environment; for each image of at least a subset ofthe set of images, extracting, by the at least one processor, aplurality of detected objects from the image; and estimating, by the atleast one processor, a trajectory of the mobile robot and a respectivesemantic label and position of each detected object within theenvironment using the sensor measurements and an expectationmaximization (EM) algorithm.
 2. The method of claim 1, wherein using theEM algorithm comprises iteratively solving for a data associationdistribution for a plurality of data associations between the detectedobjects and the semantic labels and positions of the detected objectsusing a matrix permanent algorithm in an expectation step of the EMalgorithm.
 3. The method of claim 2, wherein using the EM algorithmcomprises iteratively solving for respective mobile robot positions andsemantic labels and positions of detected objects using the dataassociation distribution in a maximization step of the EM algorithm. 4.The method of claim 3, wherein using the EM algorithm comprisesiterating between the expectation step and the maximization step untilan end condition is reached.
 5. The method of claim 1, whereinestimating the trajectory of the mobile robot and the semantic label andposition of each detected object comprises, for each image of the subsetof the set of images, extracting a plurality of geometric pointfeatures.
 6. The method of claim 5, comprising tracking the geometricpoint features across the set of images.
 7. The method of claim 1,wherein estimating the trajectory of the mobile robot and the semanticlabel and position of each detected object comprises constructing a posegraph comprising a first plurality of vertices for a plurality of mobilerobot poses and a second plurality of vertices for a plurality ofdetected object positions.
 8. The method of claim 7, wherein estimatingthe trajectory of the mobile robot and the semantic label and positionof each detected object comprises determining a plurality of soft dataassociations between mobile robot poses and detected object positionsusing, for each soft data association, a plurality of factors.
 9. Themethod of claim 8, wherein the plurality of factors comprise semanticfactors, geometric factors, and movement sensor factors.
 10. The methodof claim 1, wherein receiving the set of sensor measurements from themovement sensor comprises receiving accelerometer and gyroscopemeasurements from an inertial measurement unit of the mobile robot, andwherein receiving the set of sensor measurements and the set of imagescomprises time synchronizing the sensor measurements and the images. 11.A system for simultaneous location and mapping (SLAM), the systemcomprising: at least one processor; and a probabilistic SLAM mapperimplemented on the at least one processor and configured for: receivinga set of sensor measurements from a movement sensor of a mobile robotand a set of images captured by a camera on the mobile robot as themobile robot traverses an environment; for each image of at least asubset of the set of images, extracting a plurality of detected objectsfrom the image; and estimating a trajectory of the mobile robot and arespective semantic label and position of each detected object withinthe environment using the sensor measurements and an expectationmaximization (EM) algorithm.
 12. The system of claim 11, wherein usingthe EM algorithm comprises iteratively solving for a data associationdistribution for a plurality of data associations between the detectedobjects and the semantic labels and positions of the detected objectsusing a matrix permanent algorithm in an expectation step of the EMalgorithm.
 13. The system of claim 12, wherein using the EM algorithmcomprises iteratively solving for respective mobile robot positions andsemantic labels and positions of detected objects using the dataassociation distribution in a maximization step of the EM algorithm. 14.The system of claim 13, wherein using the EM algorithm comprisesiterating between the expectation step and the maximization step untilan end condition is reached.
 15. The system of claim 11, whereinestimating the trajectory of the mobile robot and the semantic label andposition of each detected object comprises, for each image of the subsetof the set of images, extracting a plurality of geometric pointfeatures.
 16. The system of claim 15, configured for tracking thegeometric point features across the set of images.
 17. The system ofclaim 11, wherein estimating the trajectory of the mobile robot and thesemantic label and position of each detected object comprisesconstructing a pose graph comprising a first plurality of vertices for aplurality of mobile robot poses and a second plurality of vertices for aplurality of detected object positions.
 18. The system of claim 17,wherein estimating the trajectory of the mobile robot and the semanticlabel and position of each detected object comprises determining aplurality of soft data associations between mobile robot poses anddetected object positions using, for each soft data association, aplurality of factors.
 19. The system of claim 18, wherein the pluralityof factors comprise semantic factors, geometric factors, and movementsensor factors.
 20. The system of claim 11, wherein receiving the set ofsensor measurements from the movement sensor comprises receivingaccelerometer and gyroscope measurements from an inertial measurementunit of the mobile robot, and wherein receiving the set of sensormeasurements and the set of images comprises time synchronizing thesensor measurements and the images.
 21. A non-transitory computerreadable medium storing executable instructions that when executed by atleast one processor of a computer control the computer to performoperations comprising: receiving a set of sensor measurements from amovement sensor of a mobile robot and a set of images captured by acamera on the mobile robot as the mobile robot traverses an environment;for each image of at least a subset of the set of images, extracting aplurality of detected objects from the image; and estimating atrajectory of the mobile robot and a respective semantic label andposition of each detected object within the environment using the sensormeasurements and an expectation maximization (EM) algorithm.
 22. Amobile robot comprising: a movement sensor; a camera; and at least oneprocessor configured for: receiving a set of sensor measurements fromthe movement sensor and a set of images captured by the camera as themobile robot traverses an environment; for each image of at least asubset of the set of images, extracting a plurality of detected objectsfrom the image; and estimating a trajectory of the mobile robot and arespective semantic label and position of each detected object withinthe environment using the sensor measurements and an expectationmaximization (EM) algorithm.